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Abstract 

Statistical learning methodologies are nowadays applied in several sci- 
entific and industrial settings; in general, any application that strongly 
relies on data and would benefit from predictive capabilities is a candi- 
date for such approaches. In a Bayesian learning setting, the posterior 
distribution of a predictive model arises from a trade-off between its prior 
distribution and the conditional likelihood of observed data. Such distri- 
bution functions usually rely on additional hypcrparameters which need 
to be tuned in order to achieve optimum predictive performance; this 
operation can bo efficiently performed in an Empirical Baycs fashion by 
maximizing the posterior marginal likelihood of the observed data. Since 
the score function of this optimization problem is in general character- 
ized by the presence of local optima, it is necessary to resort to global 
optimization strategies, which require a large number of function eval- 
uations. Given that the evaluation is usually computationally intensive 
and badly scaled with respect to the dataset size, the maximum num- 
ber of observations that can be treated simultaneously is quite limited. 
In this paper, wc consider the case of hyperparamctcr tuning in Gaussian 
process regression. A straightforward implementation of the posterior log- 
likelihood for this model requires 0{N^) operations for every iteration of 
the optimization procedure, where N is the number of examples in the in- 
put dataset. We derive a novel set of identities that allow, after an initial 
overhead of 0{N^). the evaluation of the score function, as well as the 
Jacobian and Hessian matrices, in 0{N) operations. We prove how the 
proposed identities, that follow from the eigendecomposition of the kernel 
matrix, yield a reduction of several orders of magnitude in the compu- 
tation time for the hyperparameter optimization problem. Notably, our 
solution provides computational advantages even with respect to state of 
the art approximations that rely on sparse kernel matrices. A simulation 
study is used to validate the presented results. In conclusion, it is shown 
how the contribution of the present paper is the new state of the art for 
exact hyperparameter tuning in Gaussian process regression. 
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INTRODUCTION 



Statistical learning methodologies are nowadays applied in a wide spectrum 
of practical situations: notable examples can be easily found in economics 
DeMiguel et al. 2009), advanced manufacturing Lynn et al. 2009 , biomedical 



sciences IDreiseitl et al. 2001], robotics Vijayakumar et al. 20021, and generally 



any data-intensive application that would benefit from reliable prediction capa- 
bilities. In short, the goal of statistical learning is to characterize the behavior 
of a phenomenon from a representative training set of observations. Such char- 
acterization, referred to as a model, then allows a probabilistic characterization 
(or prediction) of unobserved data IHastie et al. 2005|. In the following, let 



5 = {Xg 



oNxP 



} 



denote a training set consisting of an input matrix, X, and a target output vec- 
tor, y. Typically, the former is associated with easily obtainable information 
(for instance, sensor readings from an industrial process operation), while the 
latter relates to more valuable data (such as a qualitative indicator of the afore- 
mentioned process, resulting from a costly procedure). The goal is then to find 
a probabilistic model /(•) such that, given a new observation x„e^ ^ S, f{Tinew) 
will yield an accurate probabilistic description of the unobserved ynew 

Here, we focus on the well known Bayesian learning paradigm, according 
to which the structure of the model / arises as a trade-off between an a priori 
distribution p{f), reflecting the prior prejudice about the model structure, and 
a likelihood distribution J3(y|/), that can be seen as a measure of the approxi- 
mation capabilities of the model [Bernardo and Smith 1994 . In more rigorous 
terms, Bayes' theorem states that 



p(/|y)«p(y|/)p(/) 

where p(/|y) is referred to as a posteriori distribution. In general the prior and 
likelihood distributions rely on a set of unknown hyperparameters 6 that must 
be estimated as part of the training process; it is well known that an optimal 
tuning of 9 is critical to achieving maximum prediction quality. It is common 
to employ non-informative hyperprior distributions for 9 (i.e., p(9) — const for 
every feasible 9) and then to maximize with respect to 9 the residual likelihood 
distribution 



piy\o) (1) 

obtained by marginalizing p{y\f,9) with respect to /. There are two major 
challenges to achieving this result: (i) score function evaluation is usually com- 
putationally demanding and badly scaled with respect to the size of S; and (ii) 



in nontrivial cases, the score function has multiple local maxima Carlin and 



Louis 1997 . While iterative global optimization techniques can be employed to 



overcome these challenges, in general, they require a large number of function 
evaluations/iterations to converge Karaboga and Basturk 2008 . It is immedi- 



ately apparent that the combination of these issues can lead to computationally 
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intractable problems for many real life datasets. Remarkably, this problem is 
so relevant that a number of approximate formulations have been researched in 
the last decade in order to reduce the computational demand: see for example 



Smola and Bartlett 2001 and Lazaro-Gredilla et al. 2010 



In this paper, a novel set of equations is developed for the efficient compu- 
tation of the marginalized log-distribution, as well as its Jacobian and Hessian 
matrices, for the specific case of Gaussian process regression. The proposed 
computation, which exploits an eigendecomposition of the Kernel matrix, has 
an initial overhead of 0{N^), after which it is possible to calculate the quanti- 
ties needed for the iterative maximization procedure with a complexity of 0{N) 
per iteration. Notably, this result makes the proposed solution amenable even 
in comparison to the aforementioned approximations. 

The remainder of the paper is organized as follows: The problem of tuning 
hyperparameters via marginal likelihood maximization in the cases of Gaussian 
process regression is introduced in Section[T] Section [2] then derives and presents 
the main results of the paper, and states the computational advantage with 
respect to the state of the art. The results are validated with the aid of a 
simulation study in Section [3j Finally, following the conclusions of the paper, 
Appendix A provides mathematical proofs of the main results. 

1 PROBLEM STATEMENT 

In this section, the hyperparameter optimization problem for Gaussian process 
regression is presented. In the following, let 

S; x) ^ (27r)"/2|S|i/2 ^^P (~^'' ~ - /^)') 

be the probability distribution function of a multivariate Gaussian random vari- 
able with expected value /i and covariance matrix S, and let 

c^-\\og\n-\{^-^^)^-\^-^i)' (2) 

be the log- likelihood of such a distribution up to an additive constant. Further- 



more, consider the following theorem that was first defined in Miller 1964 



Theorem 1.1 Let A e , a e M", B e M*^*, b G M* and Q e Let 

X G K* be an input variable. It holds that 

G(a, A; Qx)G(b, B; x) ^ G(a, A + QBQ'; b)G(d, D; x) 

with 

D = (Q'A-iQ + B-i)-i 
d = b-HDQ'A"i(a - Qb) 
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Given a training set S, let K G M.^^^ be the full kernel matrix whose [i,j] 
entry is defined as 

K[i,j] =/C(x„x,) (3) 

where JC{-,-) is a suitable positive definite kernel function and x^ is the fc-th 
row of X. Furthermore, consider a parametrization of the model / such that 

fix; c) = kic + e (4) 

where e ^ A^(0, cr^) is a noise term, kx G M^^^ is a row vector whose j-th 
element is 



kicbl = ^(x,Xj) 

and c is the unknown parameter vector. This model structure relies on RKHS 
(Reproducing Kernel Hilbert Spaces) theory, the details of which are beyond the 
scope of this paper; the interested reader is referred to [?] and Shawe- Taylor 
and Cristianini 2004 . In the following, we restrict consideration to the case 
where the observation likelihood is derived from equation Q as 

y|c^iV(Kc,a2l) (5) 
and the prior distribution of c is 

c- iV(0,A2K-i) (6) 

The hyperparameter accounts for the variability of the coefficient vector 
c, while cr^ is the output variance defined in equ ation (|4|). N otably, equations 
([5| and ^ describe Gaussian process regression Rasmussen 2004 as well as 
the Bayesian interpretation of kernel ridge regression. Applying Bayes' theorem 
yields 

p(c|y) cxp(y|c)p(c) 

and the posterior distribution of c can be readily computed using Theorem |1.1| 
as 



c|y ~ iV(Aic,Sc) 



By combining Q and ([7|, the prediction distribution for y is 
/(x; c|y) ^ 7V(kxMc, kiSekx' + a^) 



(7) 
(8) 

(9) 
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The posterior distribution of y, obtained by marginalizing with respect to c, 
reads then 



y|(a^A^)^iV(^iy,S, 



with 



My = Km, = K(K+^I) y (10) 




^ ^ -1 



Sy = KEcK + a"l = (7" ( K ( K+ ^Ij + 1 j (11) 

It is apparent that, after the marginahzation with respect to c, the posterior 
distribution p(y|((T^,A^)) depends only on the unknown hyperparameters 
and A^. The optimal values for these hyperparameters arise as the solution to 
the optimization problem 



{a^A^} = argInaxp(y|(a^A^)) (12) 



subject to the constraints 




(13) 



Equation (12 1 is usually log-transformed for numerical reasons and converted 



to a minimization problem, that is: 

{fT2,A2} = argmin (14) 



again subject to (13), where 

:=/:,(a2,A2) = -21ogp(y|(a2,A2)) 



Notably, equation ( 14 ) defines a global optimization problem whose score func- 
tion derives from equation ([2]) as 

Ly = -2 log(p(y|(a^ A^)) = log |Sy| + (^.y - - y)' (15) 

1.1 Optimization Strategies for Likelihood Maximization 

In order to minimize the nonconvex score function £y, it is necessary to resort 
to a global optimization procedure. The optimal solution is typically obtained 
in two steps: first, an approximate global minimizer is found by means of a 
global optimization procedure; notable examples include grid search strategies. 



Particle Swarm Optimization (PSO) and Genetic Algorithms Petelin et al 



2011 . Usually, such methods rely only on evaluations of the score function 
itself, and not on its derivatives, and can generally avoid being trapped in 
local minima. The approximate minimizer obtained is subsequently used as the 
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starting point for a descent algorithm (such as the Steepest Descent or Newton- 
Raphson method) that exploits the Jacobian (and possibly the Hessian) of the 
score function to converge to a local optimal solution. To summarize, 

• The global optimization step requires a large number of evaluations of the 
score function Cy in order to span the parameter space in a dense way 

• The local optimization step requires the evaluation of Cy , and its Jacobian 
(and Hessian, if needed by the method) once per iteration, but usually only 
needs a small number of iterations to converge 

In order to characterize the computational costs associated with each iteration, 
consider the identity 

(My-y) = 4(Sy-2a2l)y 



This allows Cy to be conveniently expressed as 

Cy - log |Sy| + -^y'Syy + 4y'Sy-V - 4^ (16) 



It is apparent, given equation ( 11 1, that it is necessary to compute the inverse of 
an N X N matrix in order to evaluate Cy , an operation that has a computational 
complexity of 0{N^). The Jacobian computation also has an 0{N^) bottleneck 
due to the term 

aiogiSyi / 



At this point, having both Sy and Sy^^ stored in memory, it is possible 
to compute the Hessian matrix with a complexity of 0{N'^). In conclusion, a 
computational complexity of 0{N^), along with a memory storage of 0{N'^), 
is required to perform each iteration of both the global and local optimization 
steps. This poses a severe constraint on the maximum size of dataset that 
can be employed to solve the hyperparameter optimization problem: for this 
reason, state of the art techniques for overcoming such complexity constraints 
commonly rely on sparse approximations [?]. 



2 MARGINAL LOGLIKELIHOOD VIA KER- 
NEL DECOMPOSITION 

Section [T] showed that the marginalized likelihood maximization problem for 
Gaussian process regression is characterized by local optimality and high compu- 
tational complexity. In practical applications this represents a strict constraint 
on the size of the training dataset S. In this section, a novel set of identities 
to efficiently compute the quantities involved in the optimization problem are 
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derived and discussed. Specifically, in order to perform both the global and lo- 
cal optimization steps described in the previous section, the following quantities 
need to be repeatedly computed for different parameter values: 

• Score function Cy{<j'^, A^) 

• i n-st order derivatives — " „ and — 

, , , . . a2/:„(a2,A2) d^Cy{n^,\^) B^Cy{a\\^) 

• Second order derivatives '^-r-^ , "—r— and — „ „^,„ — 

In the following, let U e M^^^ be the eigenvector matrix and S e M^^^ be 
the diagonal eigenvalues matrix such that 

USU' = K (17) 

where K is defined in equation ([s]) and let Si be the i-th ordered eigenvalue of 
K. Furthermore, let jji be the i-th element of the vector 

y = U'y (18) 

Since the kernel matrix K has, in general, full rank, the cost of its eigende- 
composition is 0{N^): this is the initial overhead associated with the proposed 
set of identities. Propositions 2.1 to 2.3 summarize the main results of the 
present contribution. Supporting proofs are included in Appendix A. 

Proposition 2.1 It holds that 

Cy = N log <7^ + Y. (log + y^9^) - 4^ (19) 

i=l 

where 

s,- 2A^s, + cr^ 



is the i-th eigenvalue o/Sy, and 
d^ + 4: : 



is the i-th eigenvalue of (cr "^Sy + 4I]y 
I 



Proposition |2.1| exploits the simultaneously diagonalizahle property of the 
following pairs of matrices: 

• K and f K + 
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• K (k + f^) ' and (^K (k + f^) ' + I 

to compute Cy as a function of the eigenvalues of K with a complexity of 0{N). 
Proposition 2.2 It holds that 



dLy N v'y f d log di >9 9% \ 

1=1 ^ ^ 

^ A / aiogrf, -2%\ /on 

^ I 5A2 ^ 5A2 ; ^ ^ 

z=l ^ ^ 



5 log 1 1 



9(72 CT2 + 2A2si ct2 + ;^2s. 

9 log di Si 0-2 



aA2 (CT2 + A2si) (a2 + 2A2si) 



(22) 
(23) 



and 



9(72 

aA2 



O \4 „ 2 ^4 



a4 (a2 + A2 Si)2(a2+2A2 s^)' 



Si 



4 Si 



(<72 + A2sO' (a2 + 2A2si)' 



(24) 
(25) 



Proposition 2.3 



52£, 
52A2 


^ / 

= S 

i=l ^ 


' 92 log di 
92 A2 


+ y^ 




^ / 
1=1 


' 92 log di 


+ yf 




. 9A29a2 


92(72 


N 


-8^4 
(76 


N 

-E 



92 A2 J 

92A29t72 

92 log d, 
920-2 



~2 9^gi 



(26) 

(27) 

(28) 
(29) 



with 
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^2 log d. 








4s,2 


32 A2 


(a2 


+ A2,s,)' 


(a2 


+ 2A2 s,)2 


92 log d. 




Si 




2 s, 




(a2 


+ A2 5,)^ 


(a2 


+ 2A2 s,f 


92 log d. 




1 




1 


92^2 


(a2 


+ A2 S,f 


(a2 


+ 2A2 s,)' 



(30) 
(31) 
(32) 



and 



52A2 

92(72 



16 s,; 



2S,; 



(cr2 + 2 A2 s, 

8s, 

{<T^ + 2X^Sif {a 



(a2 + A2 s,r 
2 s, 
A2 s,)3 



a6 (ct2 + a2 S,)^((72 + 2A2 S,)= 



(33) 
(34) 
(35) 



Propositions |2 . 2| and |2 .3| apply standard calculus rules to compute the Jacobian 
and Hessian matrices of £y with a complexity of 0{N); it should be noted 
that all the presented quantities are also valid when K does not have full rank. 



Versions of Equations ([19|), (|21j), (l27|) and (128j) that are expressed 

directly in terms of the eigenvalues Si are included in Appendix A. Remarkably, 
by employing the proposed set of identities, it is possible to compute the current 
score function, Jacobian vector and Hessian matrix in 0{N) for every iteration 
of an optimization algorithm, following an initial overhead of 0{N^) for the 
eigendecomposition of K. 

The convenient properties of the proposed set of equations allow to assess 
the uncertainty of the estimated model, described by equation ([9| , with reduced 
computational complexity: this result, whose proof is not reported because of 
its simplicity, is summarized in the following 



Proposition 2.4 It is apparent that, in order to compute the quantity 

„2 X -1 



Var[c|y] = Sc = a" K 



A2 



K 



(36) 



two N X N matrices need to be inverted; this results in a cost of 0(N^). By 
exploiting again the SVD oj^, it follows that 



Sc = UQU' 

where Q is a diagonal matrix whose i-th element is 



(37) 
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* ~ (A2.S, + a^)s. 

It follows that the matrix Sc ca n he computed u sing Strassen's algorithm in 
C)(Ariog2 7) ^ 0(^2.807) operations 'Strassen ^ 1969 1 In most cases, only a part 



of the uncertainty matrix is of interest (for example, only the diagonal elements) . 
It is immediate to notice that, using equation \31^ , it is possible to compute 
directly only the interesting elements (inO{N) each), where equation l3(^ would 
still require matrix inversion operations of the usual 0{N^) complexity. | 



2.1 Proof of Computational Advantage 

In order to characterize the advantage that the new formulation brings to the 
problem at hand, we consider an iterative global optimization technique that is 
able to solve problem (14) in k* steps. As discussed in Section [lT] when the 
new set of identities is not employed the solution of ( 14 ) costs 



To = k*0{N^) 

In contrast, when the new set of identities is used, the cost is instead 

Tl = 0{N^)^k*0{N) 
Provided the number of examples, N , is such that ^ N , it follows that 



To 



= 0(fc*) 



(38) 



Hence, the eigendecomposition formulation results in a speed-up in the solution 
of ( 14 1 by a factor k* . The number of iterations k* depends, of course, on 



the characteristics of S, the optimization algorithm employed and the stopping 
criteria selected. In practice it is common to encounter problems where the 
value of k* is in the hundreds. Note that there exists an upper bound on the 
achievable speed-up: 

To 



(39) 

For reasonably sized dataset this upper limit has little practical implication (for 



lim 



N = 200, equation (39) would only be relevant if the number of steps needed, 
k* , is of the order of 40000) . By combining ( 38 ) and ( 39 ) , the final speed-up 
order reads 



To 
Tl 



= C'(niin{A:*, iV^}) 



(40) 



In order to practically implement the proposed set of equations, memory 
storage of 0{N) is required: specifically, all the quantities defined in Propo- 
sition |2.1[ |2.2| and |2.3| are computable as a function of the eigenvalues Si and 
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the projected target values iji (since, following from the properties of Singu- 
lar Value Decomposition, y'y = y'y)- This represents an additional advantage 
with respect to the procedure described in Section [2] which has O(iV^) storage 
requirements. In conclusion, it has been proven that the proposed methodol- 
ogy is able to speed-up hyperparameter tuning operations by many orders of 
magnitude (up to a factor of N'^) and can be implemented with much lower 
memory requirements than conventional approaches. Furthermore, in the case 
of multiple-output training datasets, such that 



the proposed technique has the advantage that the eigendecomposition need 
only be computed once (since K depends only on X), to solve the M tuning 
problems: no additional computational overhead is needed for multiple output- 
problems. Since state of the art sparse approximations have a computational 
complexity of 0{Nvn?) per evaluation, where vn? is the number of non-zero 
elements of the approximation of K [?] , it is apparent that the proposed set of 
identities provides a speed-up of hyperparameter tuning even with respect to 
approximate methods, at least if k* exceeds a certain threshold that depends 
on the sparsity rate m/N. 

2.2 Kernel Hyperparameters Tuning 

It is quite common for the kernel function K. to depend on additional hyperpa- 
rameters, such that 



Notable examples of such parameters are the bandwidth parameter of the 
Radial Basis Function kernel 



Since the matrix K depends on 9, it is in general necessary to recompute the 
eigendecomposition every time a better value of 6 is found. It is possible, how- 
ever, to exploit the convenient features of the proposed calculations with an 
efficient two-step strategy, summarized in the following algorithm. 



5 = {X,yi, . . . , ym} 



JC{x,y) := IC{x,y; 9) 




and the degree I of the Polynomial kernel 



^(a;,y) = {{x,y} + 1) 
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Algorithm 1: Two-step procedure for additional hyperparam- 
eter tuning 

Let 9 be the additional hyperparameter vector, and let (7o(-) and 
gi{-) be implementations of iterative techniques such that 



.91 (AD 



where k is the current iteration number. Given suitable ini- 
tial values 9o, CTq and Ag, consider the following optimization 
procedure: 

1. For fc = 0, . . .,krnax 

(a) Ok+i = .90 (4) 

(b) For j 0, . . . Jmax 

ii. >^l+i=9iCK) 

iii. Stop if criterion is met 

(c) End for 

(d) Stop if criterion is met 

2. End for 



It is then possible to reduce the number of required 0{N^) operations by 
implementing an efficient {0{N)) internal loop: this allows a conventional line 
search to be performed on the "expensive" hyperparameter, while solving the 
inner loop problems with a much higher efficiency. 



3 SIMULATION RESULTS 

In order to precisely assess the computational complexity of evaluating the pro- 
posed set of identities, a simulation study was conducted for a range of datasets 
of different sizes. For the sake of reproducibility, we briefly describe the sim- 
ulation environment: the results were obtained using the MATLAB R2010a 
numerical engine, running on an Intel(R) Core(TM)2 Quad CPU Q9550 with a 
clock speed of 2.83GHz, 8 GB of RAM memory and running a 64 bit edition of 



Windows 7. Given that Propositions 2.1 to ^predict that Ti(iV) = 0{N), the 



goal is to estimate the coefficients of the linear models 

t(A) = a + bN 
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Figure 1: Simulation results for the evaluation of the score function L. 



for the computational complexity of the score function evaluation ( 19 ), Jacobian 



evaluation (20 and 21) and Hessian evaluation (26 27 and 28). In order to 
obtain a representative dataset, the average execution time (on 10^ iterations) 
of these quantities was evaluated for values of A'' ranging from 32 to 8192 on a 
logarithmic scale. 

Figure [T] shows the simulation results obtained for the evaluation of the 
proposed score function equation (19). Here, the x-axis is the number of dat- 



apoints, iV, in the dataset and the y-axis is the average computation time (in 
microseconds). The estimated complexity function is 



tl{N) ~ 42.26 + 0.05iV [^s] 



(41) 



Remarkably, the computational overhead for evaluating Hy is only 0.05 microsec- 
onds per observation: This is especially relevant for the global optimization step, 
which involves a large number of such evaluations. 

Figure [2] shows the simulation results for the estimation of the computational 
complexity of equations (l20|) and (21). The estimated complexity is 



rj(iV) ~ 44.54 + 0.0867V H 



(42) 



As two values need to be computed to build the Jacobian, it is not surprising 
that the slope coefficient is about twice that of the score function evaluation 



(equation (41 )). 
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Figure 2: Simulation results for the evaluation of the Jacobian of the score 
function £„ 



The Hessian evaluation, arising from equations (|26[), (27 1 and (28), yields 



somewhat surprising results. As shown in Figure [3] the piecewise linear model 

(43) 



THiN) 



J 64.04 + 1.39iV [fis] N < 1024 
[ 1347.81 + 0.137V [us] N > 1024 



was needed in order to fit the data. Surprisingly, a large reduction (about an 
order of magnitude) in the slope is observed for N > 1024. As this feature 
cannot be linked to the theoretical formulation of Proposition |2.3[ the authors 
conjecture that it relates to internal procedures of the MATLAB numerical 
engine. Indeed, the simulation experiment was conducted on several computers 
yielding similar results. It is interesting to observe that the slope of th for 
N > 1024 is about one and half times the slope of tj and three times the slope 
of tl'. this is consistent with the number of unique quantities needed for the 
Hessian computation. 

These simulation results allow the amount of time needed to evaluate all the 
quantities of the optimization to be estimated. Following on from equations 



(41), (42) and (43) (with N > 1024), and assuming a local descent algorithm 
that makes use of Hessian information, the per iteration computational time for 
the local optimization step is given by 



TLciN) ~ 1434.61 + 0.266iV [^s] 



(44) 
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Figure 3: Simulation results for the evaluation of the Hessian of the score func- 
tion 



while the per iteration computation time for the global optimization step, which 
depends only on the evaluation of Ly is given by 



TGc(A^) - 44.54 + 0.086A^ \[is\ 



(45) 



Thus for a dataset with N = 8000 datapoints, for example, the local opti- 
mization step computation time is only 3.56 milliseconds per iteration while that 
of the global optimization step is a mere 440 microseconds. These numbers are 
even more impressive when one realizes that without the new set of identities, 
an optimization problem of this size would normally be considered intractable 



Rasmussen and Kuss 2004 



CONCLUSIONS 

In practical statistical learning applications, hyperparameter tuning plays a key 
role in achieving the desired prediction capabilities. Such tuning is usually per- 
formed by means of maximization of marginalized posterior likelihood, which 
presents two main challenges: (i) being a nonconvex optimization problem with 
local optimal points, global optimization techniques (that are usually demanding 
in terms of number of function evaluation) must be used, and (ii) such evalua- 
tions are usually computationally intensive and scale badly with the number of 
examples in the training dataset. In the cases considered in this paper, namely 
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kernel ridge regression and Gaussian process regression techniques, evaluating 
the score function has a complexity of 0{N^): as a consequence, it is often im- 
practical to tune the hyperparameters using a dataset of sufficient dimensions. 

In this paper, a set of new identities derived from a spectral decomposi- 
tion of the kernel matrix are employed to reduce drastically such complexity. 
Specifically, after the initial decomposition (that costs 0{N^)), all the quantities 
involved in the problem are computable in 0{N). This represents an advantage 
of several orders of magnitude with respect to the state of the art for exact so- 
lutions: specifically, it allows the hyperparameter tuning problem to be solved 
with a speed-up factor 0{k*), where k* is the number of required iterations. 
Furthermore, the required memory storage for the new equations is 0{N), as 
opposed to 0{N^). It is then possible to use much larger datasets for hyperpa- 
rameter tuning purposes, without the need to resort to sparse approximations. 
A two-stage procedure has also been proposed and discussed, which enables the 
efficiencies offered by the new identities to be exploited when additional hyper- 
parameters have to be optimized. Finally simulation results are presented that 
verify the computational advantages of the new formulation. 

APPENDIX A 

Let us consider the following 

Theorem 3.1 Let A e M^''^ and B e R^""^ . If A and B commute, such 
that 

AB = BA 

A and B are simultaneously diagonalizable. This means that, given an cigen- 
decomposition of A such that 

UDaU' = A 

it holds that 

UDbU' = B 

where Da and Db are the diagonal eigenvalues matrices associated to A and 
B. I 

Proof of Proposition |2.1[ Considering the identity 

(My-y) = ^(Sy-2a2l)y 
it follows that, up to an additive constant, 

Cy = log + ;^y'Syy + 4y'Sy-V - 4^ (46) 
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In order to efficiently compute Cy and its derivatives with respect to ct^ and A^, 
consider the eigendecomposition of K such that 

K = USU' 

where U is an orthogonal matrix such that UU' = I and S is a diagonal matrix 
whose z-th entry Si is the i-th ordered eigenvalue of K. It is easy to prove that 
K and (K + ^al)"^ commute, and hence 



U(S(S + ^I) ')u' = K(K+^I)- 



Following from (111, 

CT^UDU' = Sy 

where the i-th entry of the diagonal matrix D is 

_ Si _ 2\^s^ + 

St + j2 A Si+ a 

Recalling that 



where A is a square matrix and is its i-th eigenvalue, it follows that 

N N 

log|Sy| = ^log(a2d,) = TVlogd^ + ^logd, 

i=l i=l 



After this substitution, equation (46) reads 

Cy^N log a^ + Y. i°g '^^ + -^y'^yy + 4y'Sy- V - 4^ 

(J a 

i=i 

Since Sy and Sy commute, they are simultaneously diagonalizable. Letting 

y = u'y 

and letting yj be the j-th element of y. 



N 

yy 

2 



Cy^N log + ^' + y'UGU'y - 4' 



a 

1=1 

where G is a diagonal matrix whose i-th entry is 

£+4 SX-^s,^ + 12 X^s^a^ + 5 



a'^d, [a^ + X^ s^) {a^ + 2 X^ s,) 
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The final form of Cy is then 



N 



y'y 



(47) 



2 .yy 



log cr^ -4 



AT 



i=l 



8A4 5^2 + 12 Si 0-2 + 5 cr* 



Proof of Proposition 2.2 By applying derivative rules to equation (47 1, 



dCy 

dCy 

aA2 



N 



'4y[ A4 (3 ,S,2 _ 2 ,<;,,2 ^4 -2^ ^ ^8 ^2 ^ ^2 ^8 ^ 2 A^ ^4 
^ cr4(^2 + _X2s,)^(cr2 + 2A2.S,)^ 



E 

4=1 



2 A4 ,,3 + (4 ^2 ^2 _ 3 ^4) ^2 ^^2 ^ (3 ^4 -2 _ ^6) - 
(2A4 S,2 + 3A2 S,cr2 +^4)2 



Proof of Proposition 2.3 By applying derivative rules to equation (47 1, 



92 A2 
aA25a2 



92 (72 



! (t2 A^ S,5 + (24 a2 yf _ Ig + (35 ^4 ~2 _ ^3 ^6) ^2 ..^3 + (14 ^6 y2 _ 3 ^8) 



Af 

E 

-'V / ^ \S „ 5 , z'^? ^2 



E 

1=1 



4A« Si^ + (60-2 



(2 A4 s,2 + 3A2 a2 + (j42^ 
8 y2) A6 + (12 a4 _ 3 ^6) ^2 ,^2 + ^6 ^2 „ ^8) 



(2A4 s,2 + 3A2 a'^ + a^y 



N 



i=l 



with 



8 y2 s,2 (9 a4 ffi" - 12 A4 yj) + .s,3 (13 A^ a» - 12 A^ yf) + 2 ai2 y2 + 2 A2 ^^2 + g A^ s,^ 

^^72A4^7+3A2^7CT2T^T^f 
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